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Mow the Researcher Can Help the Rending Teacher 
vlth Classroom Assessment 

Robert C. Calfee 
Stanford University 

Priscilla A. Drum 
University of California* Santa Barbara 

To nany educators * tests seem an unavoidable nuisance. Tliough they are 
useful to some people for certain purposes, increasingly their usefulness and 
appropriateness are questioned. A rising chorus questions whether tests 
really provide fair and useful measures of educational progress, and colleagues 
caution against overuse of tests to no good purpose (e.g., Venezky, 1974a; 
Lcvine, 1976)/ 

The measurement tradition is strong in educational psychology. Tests are 
one of the few "scientific** elements in educational research and practice, and 
they can serve a vital role^evidence is essential to effective and efficient 
instruction* For instance^ there are definite limits to what a lecturer can 
hope to achieve, because he obtains relatively little information from the mem- 
ben of his audience~he must rely on eye contact, on signs of attentiveness, 
and on questions from •^he listeners* At the other end of the continuum, indi- 
vidualized instruction builds on the continual exchange of information between 
toacher and student; the instructional progtram is continuously realigned to 
the student's needs and strengths (e.g., Atkinson & Paulson, 1972). Frequent, 
precise* and appropriate assessment is critical to this process* But such 
tasting must be designed to fit the instructional needs of the teacher — this 
ia the burden of our present message* 

Iba taating tradition, following the lead of Alfred Binet, has focused 
attention on the selection and sorting of individuals* One can find oceanional 
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comment on "teacher-made" tests in books oa educational testing. But even herc^ 
the criteria arc those applied to tests for selection and sorting, by and large. 
That other needs exist is reflected in the plethora of terms— criterion- 
referenced, domain-referenced, behavioral-objectives, dicrnostic~all denoting 
sosethiag other than the conventional testing approach.^ There is little evi- 
dence that the "new" tests do a different job from the old ones; it is also 
worth noting that tests with quite different labels look quite similar. Re- 
searchers can provide a service to teachers by looking systematically at the 
needs for assessment in the classroom, and by analysis of the theoretical 
and empirical issues in this area. The goal of the present paper is to sug- 
gest to researchers some specific issues that warrant investigation; a compan- 
ion paper has been prepared to look at these issues from the teacher's perspec- 
tive (Calfee, Drum & Arnold, in press). 

An Overview and Two Conclusions 
How can assessment be tailored to fit the needs of the classroom? To 
answer this question, ve need to consider three other questions: 

(1) Assessment for what? (The goals) 

(2) How to assess? (The methods) 

(3) Is assessment doing its Job? (The criteria) 
We will consider in turn each question as it relates to readings but 

first* two major conclusions: 

(1) Teachers need to learn more about the process of assessment In order 
to Mms€BB for instructional purposes. 

(2) Classroom assessment ought to aim toward the precise and efficient 
■aasureaant of specific component skills for short-term decisions. 

The bulk of the paper will buttress and Illustrate these two generaliza* 
tloQSt but some preliminary comments will help to set the stage* Much research 
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on reading assessment aims toward goals quite different from those that are 
foremost for the classroom teacher. The goal of conventional achievement test 
construction (and thus of the research that centers around such test construe- 
tion) is the measurement of reliable and substantial Individual differences^ 
based on stable scores for each student which place him at some point below^ 
at. or above the average for some larger population. The aim Is an instrument 

Mking major, long-term decisions about students , teachers, and programs 
(Carver, 1974). 

The teacher needs Information of a much more immediate character. How 
well can the student read now? What specific reading instructions should he 
receive next? Is the instruction successful? To the question, "What <does the 
Tesearch literature on assessment have to say to the classroom teacher about 
liis instructional needs?**, the answer is "Mot much!** 

To our knowledge, no exis' ing assessment system handles the range of 
asTCSsaent tasks encountered by the reading teacher. Host commercial testa 
provide little evidence useful for Instruction, and are too expensive in time 
md effort for the teacher's needs. It is not that commercial tests are 
faulty, rather that they are designed for other purposes than immediate Instruc- 
tional decisions* 

Moreover, the researcher cannot focus attention solely on the character^ 
Istlcs of the assessment system, if his goal is the proper assessment of 
•todents in the multifaceted happenings of the classroom. The researcher must 
plan investigations where variation in the characteristics of the assessment 
oystea are only one set of factors — the design must also call for variation 
in tlio teacher's background and training, in the makeup of the class, and In 
the nature of the Instructional program. Designs of this comprehensiveness 
require more thought fulness than has been typical of educational research, hut 
they are technically foasfbU (Calfce, 197Sa). 
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Jn fact, one can argue that research on classroom assessment should not 
center on test construction at all, but rather on teacher training. Public 
and private contractors will hopefully improve the kinds of assessment sys- 
tems available to the teacher* But we suspect that the key to adequate assess- 
■ent for instructional decision-naking in the classroom is the classrooa 
teacher who knows how to select with care from what is available for other pur- 
poses» who can modify and simplify the materials at his disposal with an eye 
to practical application. If so, the chief task of those who would improve 
assessment of reading for purposes of instruction lies not in psychometrics, 
but in improving teaching* This does not mean that all psychometric problems 
bave been solved, to the contrary* It simply means that psychooetrics may not 
be at stage center. 

Coals of Assessment 

Vbat are alternative goals in assessment? First, certain goals aim to- 
ward long-term prediction* This is true in evaluation of the Individual 
(Cronbach 4 Cleser, 1965), for Job placement, for school admission, for a grade 
or achievement mark of some sort. It Is true when assessment serves for eval- 
natiOT of a proitram * The administrator has to decide whether a curriculum is 
affective, whether a special program is better than the regular program, 
vbether extra money is making a difference. Diagnosis also falls in this cate- 
tpfrjm Diagnosis is for special cases, like physical anomalies. A persofi 
nbo ean*t ace well has trouble learning to read. If he can*t hear very veil, 
ba may also have trouble in school tasks, including reading. These are spe-* 
dal cases and may rcquixc a clinical specialist. 

Other goals aim toward short'^tarm decisions. Assessment cam serve for 
iastructional d»cision-mskIna by the classroom teacher. The instructor has 
to stay nirrcnt on what each student knows if Instruction Is to be precisely 
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directed toward specific needs. ^ 'Individimlizatton" is the usual label for 
this concept. Each student is assessed as to his present skills^ abilities » 
and knovled|;e9 so he can be helped to move from vrhere he is toward some reason- 
able goal. 

Tasks other taan individualization also require the classroom teacher to 
apply skills in assessment: 

It is the beginning of the school year. The teacher is new 
to the school, and wants to supplement infornation in the "cum** 
folder with his own evidence. 

A new student arrives in class at midyear, and there is little 
Information available on how well he can read. 

The teacher Is planning to Introduce a new topic (e.g., how 
Ca handle polysyllabic words), and needs to know which students 
know something about the topic, and which ones are totally unpre- 
pared. 

In suwary, assessment for short-term instructional decisions covers di- 
verse situations: (a) optimizing Instructional sequences, (b) measuring 
immediate response to instruction, (c) regrouping for instruction for spe- 
cific purposes, and (d) deciding on selection and allocation of resources 
(who needs the aide's time, the tutor's time, the .terminal's time?). 

Present Methods of Assessment in Education 

Psychometrlcally •'sound" tests in use today include normative and 
crit^rion-referenced tests; these two types of tests differ little in con- 
tmmtt though designed for different applications (Green, 1976). A nora- 
fmferenced test shows how the student's score or the class's score compares 
vith the other students or classes who provided the standards for the test. 
A criterion- referenced test provides a score for a student or a class baaed 
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on the number of items mastered (answered correctly) conpared with some abso* 
lute standard. Neither type of score tells the teacher what a student knows 
or does not know; direction for further instruction is not indicated. Both 
types of tests are standardized; an exact procedure for administration is 
called for, with little room for clinical probing. Most tests are group- 
administered and use a multiple-choice format to facilitate machine scoring. 
The content resembles "goulash;** though a subtest structure is often imposed 
oo the test items, the high intertesc ^' correlations belie the different names 
assigned to subtests. ' 

It can also be said of these tests that they are reliable, that the stu- 
dent* s* relative standing is stable over time, and that they are highly pre- 
dictive of one another (Bloom, 1964). They are time consuming to administer; 
they are generally not capable of repeated administration — two or three times 
a year at most. They yield a single type of measure (percentage correct or 
some transformation thereof). 

Such tests have been developed to meet certain implicit and explicit cri- 
teria. It therefore makes sense to consider the standards and criteria that 
^ply to the construction, administration, and interpretation of a test. 

Criteria for Evaluating Tests 
He want to examine briefly several criteria for evaluating tests: reli- 
ability , validity , appropriateness , independence , discriminability , cost^ and 
repeatability . The first two arc usually discussed in texts on testing, the 
others generally not (e.g. » Anastasi, 1968; Cronbach, 1970; Farr & Tuinman, 
1972). Each criterion has several facets to it. 
lellablltty: Poes the Instrument Provide a Consistent Measure? 

Im general, reliability refers to the degree to which a measurement is co^ 
aiatent. We can consider the consistency in performance when a person is tested 
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with one form of a test and then retested with a slightly varied form. Several 
things have changed. The exact form and content of the test have changed. The 
student has probably changed. Me nay have learned something, he may have for- 
gotten something, he may have a headache now that he didn't have earlier. All 
these sources of variability tend to reduce the reliability in test-retest sttu- 
at ions. 

Test developers tend to emphasize within-test reliability. There are a 
variety of ways of thinking about this form of consistency (Cronbach, 1970, 
Ch. 6). For Instance, suppose you divide the items at random in two and cor- , 
relate the two subscores. Repeat this operation for all possible split-half 
divisions of the test, then compute the average correlation between the half- 
scores (Cronbach, 1951). This provides a measure of the extent to which each 
itca contributes consistently to the total test score. One way to obtain 
"perfect" Intratest reliability is to use a test in which the student either 
fails all items or passes all items. Test developers, to the degree that thsy 
strive for Intratest reliability, are under pressure to eliminate test items that 
yield divergent patterns of performance from one student to the next. The Items 
Khat rewiia seem likely to measure general performance characteristics rather 
Chaa performances that reflect specific Intructlonal outcomes. So if you want 
« perfectly reliable test, ask the same question twenty times. Either a Student 
kaows the answer or he doesn't. This would be absurd, of course, but in the 
llaiC it is the "Ideal" toward which reliability alms. 

Haximizlng intratest reliability is important when the test score is to 
serv* for a major decision, but it may be counterproductive for instructional 
decision-making. Teachers need to know more than the student's general ability, 
lodividualization requires knowledge of diverse patterns of performance on 
Spacific tasks for different students. For the teacher, a "reliable" asscss- 

er|c 10 
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mcnt insCrumcnC is more properly defined as one which accurately and consis- 
tently indicates the specific patterns of instruction that best fit the student's 
needs and capabilities. We shall examine this matter In more detail later in 
this paper. 

Validity; Docs the Tnstrti-^ent Measure What It's Supposed to? 

As vfith reliability, the concept of validity assumes many guises. Face 
validity means that the test looks like it measures what ought to be tested. 
Construct validity means that if several tests seem to be measuring the same 
thing, there must be something there to be measured. Predictive validity means 
that there is a correlation between a test and a criterion of performance 
(usually another test).. 

To possess adequate validity for most educational purposes, a test usually 
has to satisfy each of these criteria. For instance, one can predict reading ^ 
achievement reasonably well from mathematics tests, but teachers and parents 
vould question the face validity— it would not be seemly to measure reading 
performance with a test containing arithmetic "word problems." even if the test 
■er the usual standards of reliability and predictive validity. The researcKer 
could provide a service by exploring the issue of instructional validity— a test 
is valid when it points to an instructional treatment that improves the student's 
performance on a specified task. From this point of view, aptitude* treatment 
interaction research aims to validate various aptitude tests (Cronbach & Snow, 
in press; Walker & Schaffarzick, 1974). 

Appropriateness: Does the Instrument Measure Sensibly, Civ^n the Use to Which 
the Evidence Is to Be Put? 

Appropriateness is introduced here as a fuzzy concept covering several re- 
lated matters. In part, it has to do with whether a test is linked to the goal^j^ 
of an instructional program with sufficient directness and breadth* Researchers y 
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learn^ the meaning of this^ ^concept when public school teachers ask why no one 
tests what they tcaclf^ This complaint is fair and deserves the at teat ion of 
e valuators* ^ 

Apwropriatene^js is disregarded in the conunon practice of assigning a stu* 
^^IHT to a particular level of a test according to his age or nominal grade 
placement rather than his actual performance level. The experience of the 
Chicago schools when they selected achievement tests to be appropriate to the 
students* actual reading level is instructive in this connection (Chicago Tri- 
bune, 1975). Asking a high school student who reads at the first or second 
grade to handle an advanced level of the Metropolitan Achievement Test is a mis- 
take; whatever the score, it is unlikely to reveal the studant's actual skill 
in reading. The advanced test is for those reading at grade six or above. 
Students reading at a lower level are likely to guess randomly at the answers, 
but this performance is likely to lead to a grade level score that is higher 
Chan their actual reading ability. 

Finally, appropriateness seems to distinguish many conventional academic 
achievement tests from the alternatives represented in the National Assessment 
of Educational Progress. The goal of MAEP was to cover the range of reading 
tasks that a literate person might confront in his experiences in school, at 
vork, at play, and in the other aspects of life in the society. The typical 
comprehension test is simply not appropriate to cover the broad array of 
"theses** . that seemed important to the NAEP staff: 

Xm understanding words and word relationships (literal comprehension 
of Isolated words, phrases, and sentences); 

2* graphic materials (comprehension of the J^ingulstic components of 
drawings, signs, labels, charts, naps, graphs, and forms); 



12 



500 

Cal fee /Drum 



Kcscarcher Helps Reading Teacher 



10/76 



3. written directions (comprehension of directions, plus ability to 
carry theia out operationally); 

4. reference materials (comprehension and knowledge of indices, diction- 
aries, alphabetizing, and TV listing fornvats^ 

5. gleaning significant facts from passages (comprehension, and to a 
limited extent, recall, of literal content in the context of a 
larger reading passage); 

6. naia ideas and organization (ability to abstract upwards from the 
•entencc-by-sentence content of a passage and recognize main ideas 
and organizational features) ; 

7. drawing inferences (ability to reach a conclusion not explicitly 
stated in the passage, in most instances relying only on informa- 
tion given but in a few cases on knowledge unrelated to the passage);^ 

8. critical reading (ability to recognize author's purpose, and to under- 
stand figurative language and literary devices) (Mellon, 1975), 

It Is also the point of the research of Sticht and his colleagues (Sticht, 
1975; Sticht, Cay lor. Kern, & Fox, 1971) that the assessment of a person's read- 
ing ability (and the preparation of what he is expected to read) should be 
appropriate to the task demands — don't make life unnecessarily difficult by 
aajdng hard, tricky questions when easy, plain ones will do* 

ladgpendencc; If Several Skills Are Measured, Is There Evidence That They Are 
Hore or Less Scoc-rable and A «/ tonomous — Not Closely Correlated? 

To be »ost useful, the several scores from an assessment battery should 
provide the teacher wrlth distinctive pieces of information. When all the sub- 
test scores arc highly Intcrcorrelated, the teacher receives little guidance 
^bout distinctive courses of action. As Thorndike (1973) has polrtcd out, 
•van a nodest degree of correlation between two scores (r • .6 or note) makss 
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it difficult to make differential diagnosis^ given that the scores are nor- 
mally distributed. The magnitude of this problem for certain commercial tests 
has been discussed by Calfee and Venezky (1969), and possible remedies sug- 
gested (Calfce, in press). One desirable condition is that each test be "dean", 
i.e. t that steps be taken to insure that the test measures the desired skill 
•nd none other. We will describe later a second approach built upon factorial 
test design^ in which systematic variation in the materials and conditions of 

testing allows the tester to rind out the circumstances under which a student 

2 

can and cannot handle a task. 

Piscriminability; Vheti Possible, Infomation from a Subtest Should Be "Yes-No/' 

It takes more expertise and attention to monitor an ammeter and make de- 
cisions about an automobile's electric system than to notice simply whether 
the generator light is on or off. Similarly with a test — when the scores on 
a test take the form of a normal distribution, then fine gradations in perfor- 
Mace siatter a lot and interpretation is more difficult. It is much easier to 
Interpret performance when it is either clearly at the mastery level or alto- 
gether faulty, with no "in between** scores. • Careful specification of the task 
is required, but the benefits for instructional decision-making can be consid- 
erable (Calfee, in press) « 

Cost; How Much Time and Money to Buy» Administer » Score> and Interpret? 

Tests cost money, and they cost time. These costs may be overlooked by 
teachers, even when they are the ones who pay. For instance, in one school, 
teachers spent three days testing the students* reading skills in third and 
fourth grades. The scores were then used for the sole purpose of sorting stu* 
dents Into three reading groups: high, medium, and low* Obtaining a ten- 
sinutc oral reading sample from each student would probably have dono the sort- 
ing Job as well, or better, and at much less cost* When a major decision is to 
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be made, substantial cost is justified; when continuous short-term decisions 
are required, low cost is essential* 

Repeatability^ For Classro om Instruction—It Should Be Practical to Rcadmln- 

ister a Test Ulien ever the Teacher Needs Inforraation , 

The time and cost required by many tests makes repeated administration 

impractical. Besides this, the psychonetric concern with reactivity in re- 

testing leads to advice against repeated administrations of the same form. 

It is rare to find more than two alternate forms of most commercial tests. 

For evaluation of a program or an individual, assessment once or twice a year 

Is sufficient. But the teacher who wants evidence on the effectiveness of 
yesterday^s instruction needs an "off-the-shelf" test, one which comes in many 

formSy and can be used as often as necessary. 

A Closer Look at Reliability 
If any concept is central to research on assessment, reliability certainly 
seems the candidate. As noted above* in its simplest form reliability means 
that a measure is consistent and reproducible* Suppose, when a carpenter used 
his ruler to measure the length of a board, that each "inch" on the ruler acted 
soaevhat differently during the measurement process. Then the results of the 
Measurement would vary depending on which particular ruler was used and the 
length of what was being measured, among other things* This is manifestly un- 
desirable* By analogy, the designer of a test for the measurement of academic 
outcomes seeks to build a test from a set of items that act together consis- 
tently to measure the skill or knowledge of interest* Indices of intratest 
reliability such as split-half reliability, the point-biseral coefficient, 
alpha* or the KR-20 index reveal the extent to which performance on each item 
'la A test Contributes in a consistent fashion to the total score. 

15 
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Another way of thinkinc about reliability builds on the analysis of vari- 
ance procedure (Cronbacht 1970, pp. ISSff). For instance^ consider the scores 
for five students in Table 1. These records show a fair amount of consistency. 
Students may do well or poorly^ but each item contributes consistently to the 
total score. Item 4 is harder than- the other items, and the students who do 
aost poorly always do poorly on this item. Similarly, Item 1 is relatively 
•asy, and consistently so for the students who do best. 

Insert Table 1 About Here 

The Bagnitude of the consistency can be determined through the standard 
aoalyais of variance (refer to Cronbach, 1970, p. 159, for details of the pro- 
cedure). The total variance in the scores can be partitioned to yield three 
variance estimates (Table 2). The expected value of each variance estimate 
allows one to compute the variance component for each source, as shown beneath 
tlic analysis of variance summary table. Thus, the variance of the students* 
"true** scores is estimated to be a| « .487; the variance in the student-item 
interactions, o|j9 is estimated to be .113. The student total-score variance 
la a seasure of individual differences in the total scores. The student-item 
Interaction is a measure of inconsistencies in the way different students react 
to differ^t items • In this example, the idiosyncratic variation in items is 
relatively slight, compared with total score variance. As an index of tfie con* 
alatency of the contribution of individual items to the , total score, Cronbach 
(1951) proposed the ratio of true score to observed score variances. This is 
equivalent to the ratio between total score variance and overall variance 
(total score variance plus idiosyncratic student-item variance) : 

1- 
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llic principle here is quite simple— to take seriously the student's total scoi 
as an indax of individual difference, variation in the set of "true" scores 
should account for a fairly large proportion of the overall variance in the ob- 
served scores, which can be shovm to be the sura of a| and a' . As can be seen 
at the bottom of the table, the Cronbach alpha for these data, a « .81, is quite 
high, given the limited data. 

Insert Table 2 About Here 

Incidentally, wha:: is estimated as a reliability in this example, and 

a 

throughout the discussion that follows, is what Cronbach (1951) calls a^^, the 
consistency of the contribution of the individual item to a summary score of 
some kind* One can also calculate tne overall reliability of the total score 
of a test or subtest, but for our present purposes it is item reliability that 
is most important. It should also be mentioned that the estimates of a in thij^ 
discussion are biased; the procedure for calculating unbiased estimates is 
straightforward (Winer, 1971, p. 282), but would unnecessarily complicate the 
exaaq>le. Finally, no effort is made to apply the Spearman-Brown correction 
for test length* 

As example of an inconsistent set of items, consider the student-item 
■atrix in Table 3* The variation in the total scores of individual students is 
exactly as in Table 1, but if you examine the data closely, you will see that 
the Items ara less consistent* Items 1 and 2 are passed by some of the students 
vhose total score shows many errors; the same items are failed by some o£ the 

students whose total score shows many successes* These idiosyncratic reactions 

s 

of particular students to particular items in an unpredictable and inconsistent 
Mimer are referred to as subject-item interactions* The estimate of student* 
item variance is indeed higher for this matrix (MS(SI) « «200), and the reli- 
ability is *67, or 20 percent less than the results in Table 2* 

Insert Table 3 About Here 

11 



\ 



Calfce/Drum ^05 
Researchers Helps Reading Teacher ^^y^^ 

What arc the characteristics of a test with a high reliability coefficient? 
First , there oust be individual differences of substantial nagnitude in over- 
all performance. This is another way of saying that o| must be relatively large. 
Second. Idiosyncratic reactions to particular items by individual students .use 
be small; put othecwise. o|j must be relatively small. Items that do not fall 
Into line are relatively easy to detect, and the dependability of the student's 
total score is markedly improved by eliminating those items that do not fall 
Into line. For instance, if Items 1 and 2 are eliminated from the test in 
Table 3. the test becomes perfectly reliable. 

Suppose, however, that the purpose of the test is not to generate a single 
total score, but to yield patterns of performance, which might serve usefully 
£cr specific Instructfonal responses. We will show now that the conventional 
approach emphasizing total-score reliability can lead to the elimination of 
thm items that provide the essential Information about such patt-srns. However, 
extensions of the same basic procedure for determining reliability can be used 
to evaluate the dependability of those patterns that do exist in the data. 
These extensions build upon the landmark work of Cronbach «nd his colleagues on 
tenerslizability theory for psychological assessment (Cronbach. 1951; Cronbach. 
Clcser. Nanda. t Rajaratnam. 1972; Cronbach. Rajaratnam. & Cleser. 1963; for a 
different perspective on a similar problem, see Calfee. 1976; Calfee & Elman. 
in press). 

The key to the evaluation of patterns of individual differences is to 
think about the reliability of the patterns, rather than the reliability of the 
total test score or of a particular subtest score. Tlie analysis of variance 
tOChnique provides the technology to support this thinking, which is why we 
itttroduced It earlier. The concepts will be introduced with the aid of a 
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specific example, the student-item matrix In Table 4 (disregard the subtests 
for now). Suppose a teacher has developed an eight-item test, and has collect 
•cores from twenty students. He gives you, the rcsearclier, the data and asks 
that the reliability of the test be deterained, to insure that the Instrument 
meets customary staridards. We will now proceed to examine these data in some 
detail. At first glance the test will appear relatively unreliable. However, 
closer attention to the structure of the data—a process much like peeling an 
onion — will uncover a great deal of reliable information. The analysis of 
variance will provide a systematic accounting of the information, and at each ' 
stage of analysis we will see that reliability coefficients of increasing spe- 
cificity will be determined. 

Insert Table 4 About Here 



Casual examination of the matrix in Table 4 shows you that, while there 
are substantial individual differences in the total scores, there is also con-^^ 
siderable idiosyncratic variation in the reaction of particular students to 
articular items. Tlie situation is not too bad, as can be seen from the analy-* 
sis in Table S» The reliability measured by a is of a respectable magnitude 
by many standards, especially when you remember that the a value in Table 5 is 
the reliability of a single item. The a value for the total score can be 
shown to be - .93, for instance (Winer, 1971, pp. 286-287) • 



Insert Table 5 About Here 



lha test designer then remarks to you that the test actually comprises 
items from two distinctive categories, and he is curious about whether the two 
subtests reveal tlie differences in performance they were designed to measure. 
Xm Table 5 the items can be arranged according to a subtest structure. If you 
look at the first four and last four items for each student, you can see more 
consistent patterns within each subtest than appear when the test is examined 
as a idiols. Each student tends to succeed or to fail on all the items within 
m siAtcst— there is only modest devl^^on from the all«or*-none pattern. This 
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suggests that the pcrforaance patterns are aore reliable than the overall 

Measure in Table S suggests. 

Ilk Table 6 is shown the determination of reliabilities for this situation. 

The analysis of variance now includes the Test factor as a source of variance » 

•long with the Student-Test interaction. Two reliability indices can be coai* 

putcd, in an<;ver to the questions: 

Oo How consistent is the contribution of the subtest score to the 

total score? The answer is, only slightly so, a " •162« (The sub- 
test Is the **itea'* in this analysis.) Look at the data and you 
vill see that sone students have a high score on T^, some a low 
•core; some have a high score on T29 some a low score; and all cca- 
blaations of high and low on each subtest are represented* In 
other words, there are substantial student-subtest interactions* 
Oo How consistent is the contribution of each item within a subtest 
to tlie difference between the student's subtest scores? The con- 
sistency here shows up in this reliability coefficient » a - *836. 

Insert Table 6 About Here 



Tbm Increase in the last mentioned reliability coefficient compared with 
the total-test coefficient in Table 5 seems modest; only about 10 percent, 
lot there Is • substantial gain in our understanding of the test structure— we 
cm see that Individuals differ considerably in the subtest patterns, whereas 
the total test score is not reliable compared with variations in subtest*- 
•tudent interactic^kis. 

Vhat does the preceding analysis of reliabilities tell the test designer 
io this particular instance? The overall reliability of the test (Table 5) is 
Mderate^ but not spectacular. From this analysis alone, the test designer 
viglit be advised to throw away some of the items that contribute least consis- 
toatly to the overall test score. This would be a mistake, because these same 
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itens contribute oiost consistently to the subtest patterns. The subtest analy- 
sis (Table 6) reveals that the subtests themselves contribute inconsistently to 
the total test score» but the items within each subtest yield fairly consistent 
patterns of individual differences in the subtest scores. Thes^ patterns arc 
readily visible to the naked eye. To be sure, we created the data set. and so 
lie knew what the underlying structure really was. But there is an important 
■oral: it behooves the test designer to think seriously about the dimensions 
of the test, and of the characteristics of the students for whom the test is 
being designed (Calfee. 1976). 

We have illustrated how the researcher can help the teacher in the con- 
duct, of classroom assessment, using one of the oldest tools of the educational 
psychologist's trad^ — the analysis of test reliability. To be sure, more is 
Deeded than the examination of reliability of a total score. The tools exist 
today for the investigation of the reliability of structural patterns, and it 
is these that are likely to be of service to the classroom teacher. Inciden- 
tally, the payoff from structural analysis increases with the complexity of 
the structure. The test in the exaiq>le above bad the simplest possible struc* 
ture — two subtests. As the number of independent dimensions of pattern lit- 
Cieases, and as the number of student groups for which these are useful dimen- 
Oioos increases, it becomes more important that the researcher turn away fro^ 
simple ^'omnibus'* reliability to the more precise investigation of structural 
reliabilities. 

Wie Instructional Validity of Simple Decisions 
After reliability, the second cornerstone of test theory is validity. Ve 
fvmmt to consider here some ideas about the validity of ^cisions based on test 
nsults. where a major consideration is the simplicity of the decision. A 
Joeision in this context is a prediction— based on the evidence, the student 
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is likely to succccc^ if the situation remains as is, or the student is likely 
to fail unless sotaething out of>4:he ordinary is attcopted* One could also 
uquire whether the test points with accuracy toward a specific instructional 
treatoent» but we will not deal with that issue here. 

The usual approach to prediction in educational settings is the venerable 
Pearson correlation coefficient. It assuaes that two normally distributed 
eovariates share some cociaon variance in the form of a linear relation. This 
solution is elegant and most teachers learn something about correlation during 
their prcservice training. 

Hie. technique Is straightforward. If we know (a) a student's scoze on 
th* predictor test, A» (b) the mean and variance of A, (c) the correlation be- 
tween A and the criterion or to-be-predicted test» B» and (d) the mean and 
irariance of B» then we can readily compute an estimate of die student's prob- 
sble performance on B» along with confidence bounds on the estimate* This pro- 
cedure assumes normality of the distribution of scores. 

Teachers seldom make use of -he procedure just described. They are not 
comfortable with statistics^ they have neither the time^ the information^ nor 
the computational formula. Thus» knowing that .70 is the correlation between' 
a diild*s score on a readiness test at the beginning of kindergarten and his 
first-grade reading achievement is little help to the typical c).assroom teacher 
M matter how dedicated he might be. Of even less help are predictive rela- 
tions established by more sophisticated techniques^ such as step-wise multiple 
regression^ discriminant analysis^ factor analysis^ or the like. 

Im our research we have explored some alternative approaches to prediction 
hased on all-or-none tests^ with interesting consequences.^ The general tech- 
nique is most conveniently prc^nented by a concrete example. A kindergartner's 
kmowledge of the names of the letters of the alphabet is known to be predictive 
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of subsequent performance on reading achievement tests. (See fcr example, 
Clbson & Levin, 1975; Vcnezky, 1975). The reasons for this relation are com- 
plex» and undoubtedly have more to do with home environment, general ability, 
anount of time spent watching Sesame Street , and so on, than with specific train- 
ing on letter nanes. Alphabet knowledge is an indicator, not a cause, of read- 
ing success or failure. 

The technique works as follows: early in the school year ask a group of 
kindergartners to name each letter of the alphabet — this yields the predictor 
•core. VThat shall we predict? Suppose we measure reading achievement of these 
children two years later when they leave the first grade. Divide the students 
Into two groups: those who read at or above grade level and those who are be- 
lov grade level. The former group has ''succeeded'* by conventional standards. 
The children in the latter group are below an acceptable level of perfornance, 
•nd Bight have profited from additional instruction during kindergarten and 
first grade. In any event, we have a simple metric to be predicted — success or 
failure. 

Mow for the validation. How wel^l can the kindergarten teacher sort chil- 
dren Into those who will probably succeed and those who probably need additional 
help^ using the child's knowledge of letter names? What is the decision rule 
for sorting; ho-^ complicated does it have to be; how accurate will it be? 

He have some data on this question. Kindergarten children were tested 
in 1970 on their ability to name each of the twenty-six upper-case English 
letters (Calfee» in press). TWo years later at the end of first grade, they 
took the Cooperative Primary Reading Test (Educational Testing Service, 1970)* 
We obtained complete records for 144 children from the original sample of 276. 
There is a varkcd relation between alphabet knowledge and reading achievement 
in this group of students; the correlation Is .SO. 
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An Intcrcstinn pattern appears if we examine the frequency distribution 
of alphabet scopes (Figure 1). First, the distribution for the entire sample 
is markedly blnodal (top panel). Second, children who are below grade level 
•t the end of the first grade are disproportionately represented at the lower 
end of the distribution (they did not know their ABCs at the beginning of 
kindergarten), wnereas the children who were above grade level are dispropor-* 
tlonately represented at the upper end (they did know their ABCs). The corre- 
lation describes accurately the linear relation between the two variables, but 
it does not reveal the blaodality of the distributions and the potential Cor 
•iiq>le decision-making inherent in that bimodality. 



Insert Figure 1 About Here 
la particular, suppose we sort children into two groups by a ** cut-point** 
on the. alphabet knowledge distribution; we might classify as **in need of addi- 
tional instruction" all children who identified ten or fewer letters. Then 12 
of the 61 children who were at or above grade level would have been misclas- 
•ified as needing additional instruction (they knew ten or fewer letters when 
they entered kindergarten, but net the grade level criterion at the end of the 
first grade) ; 28 of the 84 children who were below grade level would have been 
misclassi^icd as not needing additional instruction (they knew more than ten 
Imtters on entry to first grade, but failed to meet the grade level criterion). 
This means that by placing a cut-point at ten or fewer letters correctly iden- 
tified, 12 Qut of the total 1A4 students, or 8 percent, would be misclassif ied 
as needing instruction when they would end up doing all right without it, and 
28 out of 144, or 19 percent, would be nisclassif ied as not needing instruction, 
hut would end up below criterion. The total misclassif ication rate would thus 
ho 27 pcrccat at this cut-point. 

Figure 2 shows what happens as the cut-point is moved from the lowest to 
Q tho hlglicst alphabet score for this set of data. If the cut*point is at the 
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extreme left of the abscissa » then even if a child cannot identify a single 
letter, he is given no supplementary instruction* All of the children vho fail 
to reach criterion are laisclassif ied under this condition; none of the children 
i*ho meet criterion are misclassif ied, of course, since by definition they need 
no additional help. As the cut-point is moved to the right, more and mo^e stu- 
dents are assigned to supplementary instruction. At first, most are from the 
below-criterion subgroup. There is a wide flat spot in the misclassif ication 
function^ reflecting the small number of students in the middle portion of the 
bimodal distribution of alphabet knowledge scores. At a cut-point (or critical 
value) of 10 in the figure, the percentages mentioned- above can be seen; 8 per- 
cent of the students are falsely classified as needing more help, 19 percent of 
those that need help are not so classified, for a cumulative misclassif ication 
rate of 27 percent (the sum of the previous two percentages). Eventually, at 
the right-most side of the abscissa, all students receive supplementary instruc^ 
tlon» even those who know all the letter names. This means that all of the 
above-criterion students are, by definition, misclassif ied. 

Insert Figure 2 About Here 
Let us emphasize two features of tnis procedure. First, it is simple. 
Ifo can say to the teacher: "Give the child a test. If he makes more than X 
successes, he*s probably (this can be made more precise) going to do all right. 
If be makes X successes or less, then he's probably going to be in trouble and 
you had better think about what might be done to prevent failure." There are 
BO complicated statistics. 

Second, it is robustly accurate. The total misclassification rate in 
Figure 2 drops to a low of 25 percent, and stays at that level over a broad 
range of cut-points. (Incidentally, Feshbach, Adelman, &. Fuller, 1973, using 
a predictive test battery, or teacher judgment, or both, found that the mis- 
clAdslf ication rate from their measures and procedures ranged around 25 percent 
for a aaaplo of almost 600 students.) 25 
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should be stressed that nothins ir the present analysis of alphabet 
knowledce scores and reading achievement is implied as to the most appropriate 
actions for a child in need« This is clearly not a precise test that calls for 
a specific treatment. It is probably acting as a general indicator of a variety 
of abilities and skills; the instructional response can be only a general one. 
Standards for Practical Classroom Assessment 

A cursory examinaticn of the research literature reveals the emphasis on 
tests suitable to long-term, major decisions (e.g., Weintraub et al. , 1974, 
pp. 460-464; 1973, pp. 429-447). The teacher»s need for in-class assessment, 
<m the other hand, is best met by tests that are speedy, precise, clearly 
••appropriate," and flexibly repeatable. The concepts of reliability and valid- 
ity need to be defined in unconventional ways to serve in the design of tests 
for instructional decision-making. 

The teacher cannot expect to find on-the-shelf tests that are well suited 
to short-term instructional decisions. Moreover, training on "test construe- 
tion** reflects the conventional psychometric tradition, and so the teacher 
is likely to be poorly prepared to select, to adap^,. and to create useful in- 
struments^ It is not the intention of this paper to go into detail about the 
program of teacher training that might alleviate this gap. However, we suspect 
that it would center about an analytic approach to "what is being taught" — we 
have referred elsewhere to the distinction between a ••^cllo" model of the mind 
in contrast with the "works in a drawer" model, the former being more Gcstalt- 
llke» the latter more analytic and information-processing in character (Calfce 
4 Plo/d, 1973). Although the literature on teaching effectiveness needs to be 
approached with caution, one can find consistent signs to support the notion 
that the analytic-minded teacher is more effective in promoting academic growth, 
(Pottcr» 1975; Rosenshinc & Furst, 1971). Another instance comes from the work 
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of Evertson and Brophy (1973): "The teacher who is well orcanized, who moni- 
tors the class regularly and nips potentially serious problems in the bud, and 
who has well established routines for handling everyday procedural matters 
tends to be more successful in producing learning gains." This sounds to us 
like a description of a highly analytic teacher. 

Next, we want to highlight three desirable characteristics of tests to be 
used for short-term instructional decisions: 

1. The individual test needs to be "clean," in the sense that demands 
on the student extraneous to the skill being measured are kept to 

a ninimum. The results from a clean test are much easier to inter- 
pret than those from a test where many factors enter in an uncon- 
trolleld fashion. 

2. Rather than being rigorously standardized, the testing system 
should permit clinical probing. Such variations in the testing 
procedurp need not be random. We have proposed factorial test 
designs as a method for systematic exploration of the student's 
ability to handle a task. 

3* tests for instructional decision-making require more attention to 
breadth than precision (Cronbach [1970] refers to these as "band- 
width** and "fidelity," respectively). Achieving this goal requires 
attention to efficiency in the testing procedure, and especially 
in the choice of where to begin testing for a student. 
Each of these issues — clean tests, factorial test design, and efficient 
entry testing— is a comi^lax matter. We cannot do more below than emphasize 
a few of the main points. 
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Clean Tests 

A clean test is one in which a single well defined component is examined 
(Calfce. in press; Calfce, Chapman, & Vcnezky. 1972). The test begins as simply 
as possible; ideally, no student should make a mistake under the simplest con- 
ditions. This shows the student understands the nature of the test and can 
handle the general test-taking requirements. Then the difficulty of the test 
Is increased systematically. As errors occur they indicate the nature of the 
student's problem. Developing a clean test often requires working backwards, 
asking the question, "What laust the student know to be able to succeed in this 
task?" In answer to the question, "What does - failure mean?" the teacher must 
■ake a guess. Based on the guess, the teacher decides how to simplify the test. 
If the guess was correct and the student is now successful, his problem has been 
isolated. If ho still makes mistakes, the guessing-testing process is pursued 
further. 

The major barriers to a dean test are often the general test requirements. 
To do well on a test, the student must understand whac is expected of him, and 
■ust feel encouraged and motivated to do well. Listening carefully and follow- 
ing Instructions are important for success, and some students are better at 
these general skills than are others. Individual or small-group testing makes 
It easier for the teacher to assure that all students know what they are to do, 
and Bakes it toorc likely that performance will reflect specific rather than 
general skills. The clinical tester receives the training needed to gain 
understanding; the classroom teacher may not have had any such training, but 
lie can be aided by guidelines for determining readiness for a test, and sug- 
gestions about how to promote readiness* 
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Factorial Tests 

Complementins the notion of a clean test is the idea of factorial test 

ff 

structure. The clean test approach aims toward constancy in all dimensions of 
the test except one; the factorial approach aims toward systenatic variation 
in several dimensions of the test. Because the concept is new, we will illus- 
trate in Figure 3 how a factorial structure provides a framework for the in- 
structor to think about in testing reading comprehension. One dimension is 
the nature of the task; oral reading, silent reading w^ith no time pressure, 
sad silent reading with time pressure* As a student becomes competent he shoull 
be sble to perform well and equally so under all these conditions. K ^econd di- 
BensioD is the "question mode." How shall the teacher request information from 
the student after he has finished reading? Perhaps the simplest approach is to 
•sk him direct, literal questions — these can be quite specific or may allow for 
a Bore general response to the passage. A recognition test is slightly more 
difficult, because the student has to read the question and the alternatives, 
but St least the answers are provided to him. Production and essay tasks de- 
Mnd even more from the student. To summarize a story requires some sophisti- 
cstion, and failute can be traced to any of several possibilities. If perfor- 
MDce has been measured under simpler conditions, most of these possibilities 
csn be evaluated* Variation in materials is the third major dimension. It 
Mkss quite a difference whether the student is reading a familiar or an. unfa- 
aillsr topic; difficulty level of vocabulary also makes a difference. 

Insert Figure 3 About Here 



Knvlslon each student's performance in the multi-dimensional space of 
figure 3. Tlie task of Che instructor is to locate the student in this space, 
ill the sense that the Instructor knows whether the student can perform accur- 
ately and quickly in each cell* In fact, one might conceive of testing that 
mUm to trace thr^ough the chree-dimenslonal space s line that represents the 
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boundary between where the student can perform adequately and where he has 
trouble. Lord's (1974) discussion of "tailored" testing provides a rationale 
for the unidimensional situation; the multidimensional case remains to be de- 
veloped, to the best of our knowledge. 
Entry Level Assessment 

We agree fully with Cuszak's (1972) characterization of the good diag- 
nostic reading teacher as someone . • capable of making a sequence of rela- 
tively simple determinations of a pupil*s reading achievement level, his achieve- 
■ent potential, and his prominent skills needs" (p. 22). For the teacher to 
accomplish this task with any precision, especially when the individual differ- 
ences within the tlass are substantial, the teacher must make quick and accur- 
ate determinations of the student*s level of performance. Starting an assess- 
ment in the right "neighborhood" is essential if time is to be used wisely. 

Vhere the teacher has continuing day-to-day knowledge of the student, choos- 
ing the proper ^'entry point" for assessment may be fairly easy. But what about 
the new student? The new subject matter? The first day of class? 

Developing instruments to meet this need seems to us an interesting chal- 
lenge, and so we will report our experiences — we have little evidence on the 
reliability of these procedures, though they spring from a well established 
atetistical framework (Uald, 1947). 

Bere Is a systematic but flexible technique for rapidly classifying stu- 
dents whose level of decoding, vocabulary, and comprehension is unknown and 
My range anywhere ftom first to eighth grade (Calfce & Hoover, 1974). Choose 
a few lists of words arranged by difficulty level, and say tc the student 
••Here arc some word lists I would like you to read." Which list. A, B, C, D, 
or do you think you can read?^^ As soon^as the student has pointed to the 
list he thinks he can read, the teacher has a piece of useful information. If 
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the student's self-assessment agrees with his subsequent pcrformancCy he knows 
realistically what he can do. If he performs two or three levels below his es- 
timate , he at least has a good self-concept* 

The teacher then asks the student to read the list he has just pointed to. 
If he has trouble with several words, the teacher asks him to try an easier list. 
If he pronounces every work quickly and correctly^ the teacher asks him to read 
a harder list. The student will reach the limit of his skill within a few min- 
utes* A similar procedure Is used to assess the level of understanding of word 
■leanings and of paragraph comprehension. 

Ve have used a test built around this model for research activities , and 
are pleased with the rich return from what Is generally less than a twenty- 
minute test session. But the point to be stressed here is the value of this 
test for purposes of determining entry level to other tests (and' to instructiocj^^ 
of course). Precise assessment of a student *s skills and knowledge, if -it is 
to be also efficient and not time consuming, requires a quick screening to 
determine relative standing in different component areas of reading. 

Categories of Reading Skills 

Reading includes several areas of knowledge and skills and any analytic 
effort to assess reading must attempt a "first cut" of the collection into 
reasonably digestible pieces. Ve have suggested elsewhere (Calfee, Drum, & 
Arnold, in press) this list: oecoding, vocabulary , ^raianar, transliteral com- 
prehension, and inferential comprehension ^ 

Decoding is the translation from print to sound. It is not clear at what 
point during the acquisition of reading that the student can best develop this 
aklll* Neither is it clear how decoding skills serve the- advanced reader, 
lot a good deal of data exists to support the proposition that the reader of 
Eog^l'Sh can*t look at new sets of words and decode them with fluency Is 
O likely to have trouble acquiring mastery of other reading skills: 
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1. Not all reading programs do a good Job of training students to decode. 
Certain approaches are noticeably less effective in promoting the ac- 
quisition of decoding skills (Barr, 1974; Chall, 1967). 

7. Not all students learn decoding skills in the elementary grades. At 

the end of the fifth .'^rade many children still evidence lack of skill 

in handl5ng basic decoding skills (McDonald & Elias, 1975). 

3. Substantial correlations arc found between decoding skill and school 

4 

performance up through college (Venezky, 1974b). 
The student also needs to be able to define words, to appreciate synonyms, 
and to recognize common usage of a word. The science question in Figure 4 re- 
quires some understanding of the word orifice . The dictionary definition is a 

But few words haye a single meaning, and common words have many meanings, 
nirthcrmore, even if the student were to internalize the dictionary, society and 
individuals keep devising idiosyncratic meanings. 

Insert Figure 4 About Kere 

leading teachers realize that vocabulary development is vitally Important 
to Success on academic tasks. Austin and Morrison (1963) reported that more 
than 75 percent of the teachers in their sample spent "considerable" or '*moder-* 
ate" tine'^in vocab*:ilary development. Rubin, Trismen, Wilder, and Yates (1973) 
report comparable findings in their survey of teachers in compensatory reading 
programs. Unfortunately, it is far from clear that the instructional emphasis 
ifl accompanied by adequate assessment, sufficient to show not only whether the 
student "knows,** a word, but at what level, and with what degree of fluency. 

Some may find it quaint to include grammar as part of the reading process, 
iNit it probably has as much place as comprehension skills. In both instances, 
understanding requires the transfer of skills from oral language to a new con^* 
taxtt and the expansion and elaboration of those skills to meet the peculiar 
demands of the written languata (Olson, 197S). An important distinction also 
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exists between style and substance. Style refers to following the proper con- 
vention: producing all the past and plural markers, using proper word order, 
and the like. If the student is going to speak or write English "properly/* 
he has to know the conventions and use them in the proper context. There are 
also substantive natters in grammar. Sometimes meaning is disambiguated only 
when the plural marker, the past marker, or some other morphological ending is 
noted. If a particular word order has one meaning and a different word order 
conveys a different meaning, a substantive difference in grammar Is apparent. 
••Bill told Jane to snitch the Ice cream'* has a different meaning from "Bill was 
told by Jane to snitch the ice cream." The answer to "Who will be punished for 
snitching the ice cream?" depends upon recognizing this difference. Many chil- 
dren coine to school with adequate knowledge of English syntax; others may need^ 
none help. It is the task of instructional assessment to distinguish one group 
from the other. 

Comprehension is a complicated matter; it can be virtually synonymous with 
thinking. Trying to analyze the process of comprehension is an interesting 
challenge. Wc propose here two broad categories of comprehension tasks, trans- 
literal ahd inferential. Translitcral comprehension requires the student to 
have Meanings for the words, recognition of word order, and either direct or 
analogical experience with the content, so he can extract and remember informa- 
tion conveyed directly by the passage, information fairly close to the surface. 
SMe questions can be answered by using matching techniques, some by prior ex- 
perience without reading the passage, and some require an understanding of key 
tans. Useful assessment procedures sort out the strategies used by students 
to answer various types of questions. 
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There is a kind of comprehension that rcquries a broader and deeper analy- 
sis of the textual infornation* For instance, consider this •'conprehension" 
question: 

""Most of the women in the United States are • 

(s) pluiabers, (b) citizens, (c) redheads, or (d) waitresses.** 

With no, passage to read, how does the student select the right answer? The 
tssk is only oodestly related to reading, though it cooes from an actual com- 
prehension test. The student unfamiliar with our culture might think that "red* 
heads** wa$ right; **wai tresses** makes sense if many of the women in his experi- 
soce have been waitresses. An advocate of the women* s liberation movement might 
dioose •*plumbers.** The ** correct** answer to the question actually seems stilted 
and perhaps absurd. The student must rely on knowledge and experience that goes 
beyond the question and looks at the demand of the task. The good reader brings 
to bear on the topic what he knows, what he learns from the passage, and what 
he ean figure out about the tester's reasoning and intentions. The teacher needs 
to know which of these is behind the **poor** student's failure. 

The teacher who wishes to **measure comprehension** should be prepared to 
cover the'^full range of the student's skills — these include not only finding 
facts and making simple inferences, but also solving the problem of when to do 
CDS or the other. Horeover, the making of inferences is not only a logical pro- 
cess* Kany comprehension questions require a process of inference that is more 
analogical than logical. This requirement seems altogether reasonable, because 
life experiences are often based more on metaphor than logic. We make compari- 
mh with experience and fill In the misning parts of an event by analogy rather 
than by Aristotelian inference. 

The reason for the separation of reading into components like those listed 
above Is/straightforvard— methods of assessment and selection of instructional 
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treatment arc distinctive for each component. If such is not the case, then 
the division into components is a useless exercise. The nethodology for evalu- 
ating the hypothesis that these are independent components — and such a hypo- 
thesis is inherent in the listing of the components, we believe — is also 
straight forward (Calfee & Elman, in press), though only a smattering of research 
exists currently. We realize that our ''shopping list" is not the same as what 
others might propose; indeed, with more thought and evidence we might want to 
change it. But we see little point to continued argument about the "fundamen- 
tal cooponents" in skilled reading and the acquisition of reading. Let re- 
searchers movfc on to propose the systesiatic, comprehensive, and generalizable 
researdi designs necessary to decide which of the many process models are vi* 
ible. Such research will have theoretical and practical payoff. In the meao- 
tise, ve might put a moratorium on models with more than 1 ±2 infonaat ion- 
processing stages; these tend to overload the capacity of the reader to under- 
stand the model. 
Task Requirements in Assessment 

Im examining these caregories of reading skills, we also need to analyse 
the task requirements for successful performance on a particular test within a 
givea category. Some task requirements are specific to a given area, but others 
cut across all areas. For instance, the same basic situation may be presented 
to the reader so that he must recognize the correct answer from a set of alter- 
matives, or must produce the correct answer from memory. The person's skill may 
mllow him to perform well on one form of the task and not on the other. As 
Klntsch (1970, Ch. 5) notes, different performances under the two task formats 
fermlt the researcher (or tester) to infer underlying processes. Kccognition 
mf previously studied information suggests the information has been stored 
mdequately; recall suggests tliat it was stored in a retrievable format. 

35 
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To find what a person ••rcall>" knows, the teacher must devise various 
ways to tap that knowledije. As noted earlier, it is relatively easy to show 
that the student cannot rccieniber it under certain conditions. Tlie iiost direct 
way to assess a person's knowledge is to ask him a direct question. If he does 
not give the answer, then a second, more probing question can follow. "Do you 
think it's this?" Maybe the probe will trip the memory key so the student re- 
sponds with the correct answer. 

Speed and accuracy comprise another important task dimension. Speed is 
not always ••good," but often it is. Automaticity in basic skills can be espe- 
cially critical (La3erge & Samuels, 1974). For example, a few years ago we 
wrked with some researchers who were developing a reading series fbr kinder- 
Sartners. They had devised an algorithm for teaching children to decode. First 
the student learned a few letter-soucd correspondences, then he moved his finger 
from one letter to another to blend the sounds: "b;" "b-a, ba;" ••ba-t, bat." 
Within a short time the kindergartners cuuld decode a fairly substantial set of 
«90rds. Some students were much faster than others, or course. Some could look 
at the word and say "bat" and others were still going "b-a-t, bat.^^ Then they 
were asked to read sentences for the first time. The task dj^ged from decod- 
ing one word at a time at a relatively easy pace to decoding a whole string of 
^Tim* Furtliermore, the children were expected to answer questions when they 
finished the sentence. A few seemed to become "instantly dyslexic" at this 
Jmmcture in the program. In our opinion, this resulted from differences in 
speed of decoding. Speed of reading single words was not important per sc. 
But it took so long for some students to translate the sentence word-by-word, 
that by the tine they reached the end of the sentence they had forgotten the 
heginning. Since the decoding strategy didn't work, chese students began to 
guess from initial letters, or they looked at the pictures, searchlnr* for mean* 
ittg with little regard for the print — straUrgles typical of poor readers* 
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What is the import of the speed-accuracy distinction for the classroom 
teacher? Formerly, teachers were encouraged to test for hoth speed and power. 
Today* in the era of behavioral objectives and mastery learning, the distinc- 
tion is largely ovi^rlooked. The student who is correct on 80 percent of the 
items on a multiple-choice test has •'mastered" the objective, without regard 
to hov quickly and easily he perforns tlie task, and without regard to how he 
might perform under different conditions and different demands (e.g.. Block, 
1974). If the objective is fundamental to the learning of another task, the 
student nay come to grief unless he is fluent with the first objective. In 
this connection, some evidence has been cited in support of the relative inde- 
pendence of speed of reading and accuracy of comprehension (Gates, 1921; Singer, 
1970). Unfortunately, our reading of the evidence leaves us far from convincei 

•bout the actual degree of separability of these two measures. 

# 

Another point can be mentioned only in passing. Assessment is often most 
meaningful when carried out in a training context (Calfee, et al., 1971). Short- 
ten training may serve to clarify the task demands for the student. The teacher 
can note questions and comments by the students as they perform the task. In 
the State of California, at least one major assessment project includes a pre- 
tmst Which the teacher is encouraged to give to students until they are thor- 
•mlhly familiar with hou^ to take the test. Certain commercial tests (e.g., 
Stanford / Jiievcment Test Battery) also include short practice tests to famil- 
iarise the students with the format and type of content they can expect to en- 
aounter. This seems a nost sensible practice. More generally, the teacher's 
assessment should aim to measure the student's response to the ongoing instruc- 
tional program. 

37 
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Assessment of Transfer 

Educators must aira to teach for transfer. Teaching students everything they 
need to know is impossible. Acquiring knowledge that is transferabic generally 
requires that the student understand principles as well as basic facts. Trans- 
fer sometimes happens automatically, but it is often advisable to teach the prin- 
ciple, and then to check or assess whether the principle has actually been ac- 
quired. Giving many examples of a principle allows students to have experience 
with a variety of instances where the principle applies. This procedure means 
that the teacher oust be continually checking not only what students have learned, 
but also whether the student has attained the principles. 

Rovdoes one assess the extent of transfer? By changing certain features 
of the situation from those that existed during training, and seeing whether 
performance remains stable. By choosing novel instances of a general principle 
act part of training, and seeing whether the student can apply the principle. 
By asking the student to state the principle and to supply novel instances exem- 
plifying the principle. 

Silberman (1967) demonstrated some years ago the importance of assessment 
of transfer In the evaluation of a beginning reading program. Teaching students 
to read a list of words by rote is fairly easy— it may be dull for the teacher 
and student, but it can be done. However, when Silberman tested for transfer 
Ming a variation on the Esper paradigm. In which one portion of a set of asso- 
ciations arc learned and transfer is measured by testing other portions of the 
system (Figure 5). he found that the students had learned what they were taught, 
nothing more. Using the transfer measure as the standard for a good training 
program, Silberman proceeded to modify the training program until it worked— 
until the students learned not only what they were taught, but the principles 
that allowed them to apply the knowledge in new situations. 

Insert Figure 5 About Here 
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Silbcrman tested transfer through thft Esper paradigm. This is only one of 
several paradigms developed by experiracntal psychologists to measure "what is 
learned" in a deeper sense than simple rote associations (Cal fee. 1975b. pp. 

i * 

393-398; pp. A23-A29; Calfee, 1975c; Martinson, in preparation). The advantage 
of these paradignis is that they provide precise information about what elements 
of original learning have and have not transferred to a new situation. This 
precision is in contrast to the vague measures that are all too often used as 
an Index of "transfer" in reading research— the criterion measure is perfomance 
on the California Achievement Test, and the transfer measure is performance on 
the Metropolitan Achievement Test. Whether one observes transfer or not, the 
exact meaning of the results is uncertain. 

Summary 

What can the researcher do no help the reading teacher with the task of ^ 
classroom assessment? In our opinion, this is an area of need that has scarcely 
been touched. To be surc^ many of the new movements in testing seen to have 
the goal of improving classroom assessment. But the new tests seem quite like 
the old in appearance and application. The teacher is told not to measure the 
student's performance against the norms of grade level equivalent or percentile 
rank. Rather, the teacher should use a criterion — the student must pass 8U per- 
cent of the items on a multiple-choice test. But are the items really appro- 
prlate? Wliat is the relevant donain for generalization? To what degree 'does 
the multiple-choice task relate to other tasks? Wliy 80 percent — why not 50 per- 
cent or 100 percent? How reliable are the data for a particular decision? How 
valid is the decision? 

These are not esoteric questions. They are at the core of the issue of 
vhethcr it is worth the teacher's and student's time and effort to carry out the 
assessment* 

ERIC 39 
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b 

Conventional "norm-referenced" tests build upon a substantial and well- 
developed theoretical base. With suitable modification, the same principles can 
•crve in the development of tests for in-class use. The empirical procedures for 
certifying the adequacy of conventional tests is also well established. Little 
more is needed for certifying in-class tests, save for the linking of these tests 
to the instructional base. The norm- referenced test is curriculum-free. The in- 
class test has to prove its usefulness for making effective and efficient instruc- 
tional decisions, and for assessing the direct and indirect results of instruction 
f loving from such decisions. 

Carrying out research within this framework will pose special challenges to 
the behavioral scientist. It requires continuous assessment while the student 
1» engaged in instruction. Computer-assisted instruction solves some problems 
of control over instruction, and for certain purposes this may be desirable. But 
woBt Students learn to read in classrooms with a teacher, and it is in this con- 
text that we think the greatest payoff will be found. The costs are substantial— 
the investigator must make himself welcome in the classroom to the point of estab- 
lishing a collaborative relation with the teacher. The instructional materials 
and the instructional activities of the teacher need to be monitored and in some 
instances brought under control. We believe that the payoff can also be con- 
aiderable: increased knowledge about the cognitive processes that mediate the 
acquisition of reading skill, and the development of practical assessment tools 
for more effective teaching of re.-^ding. 
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Table 1 

Exoniplc of Student-Item Matrix 

vlth Consistent Itesis 
(O*correct , l^error) 
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Table 2 

Analysis of Variance of SCudcnt-ICea Matrix. 
Estimation of Variance Components , 
and Calculation of Reliability \ 



1« Analysis of variance suooiary table 
Source df MS EMS 

Students 5 . 600 ^ a| 

ItemB 3 .433 O^j ♦ 0* 

SI 15 .113 a|j 

2. Estimation of variance components 

oj - HS(S) - >1S(SI) - .487 

ojj - KS(SI) - .113 

3. teliabillty of contribution of each it4 
to individual differences in student*s 
total score 



a ^ ^^2Z_ - .81 

oj ♦ ojj - .487 + .113 
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Table 3 
Example of Student-Item Matrix 
Consistent Items than those in Table 1, 
Showing Analysis of Variance 
«nd Estimation of Reliability 



Studcnt-Itea Matrix 





1 


Items 
2 3 


4 


Student 
Total Score 


A 


1 


1 


1 


1 




B 


0 


1 


1 


1 




C 


1 


0 


1 


1 




D 


1 


0 


0 


0 




E 


0 


1 


0 


0 




F 


0 


0 


0 


0 


0 


Itea 
Totals 


3 


3 


3 


3 
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Tabic 4 

Student-Item Matrix with Items Grouped 
According to Test Factor T 





















Student 


Subtest 


Totals 


















• 


Total 


Items 


Items 


I ten 


1 


2 


3 


4 


5 


6 


7 


8 


Score 


1-4 


5-8 


1 


1 


1 


0 


1 


0 


0 


0 


0 






0 


2 


0 


1 


0 


0 


1 


1 


1 


1 






4 


3 


1 


0 


0 


0 


0 


0 


0 


0 






0 


4 


1 


1 


1 


1 


0 


1 


1' 


1 






3 


5 


1 


1 


1 


1 


0 


0 


0 


1 






1 


6. 


1 


0 


0 


0 


1 


1 


1 


1 






4 


7 


1 


1 


1 


1 


1 


1 


1 


1 




4 


4 


8 


1 


0 


0 


0 


0 


0 


0 


0 






0 


9 


0 


0 


0 


0 


1 


0 


1 


1 






3 


10 


1 


1 


1 


1 


0 


0 


1 


0 






1 


11 


1 


1 


0 


1 


1 


1 


1 


1 






4 


12 


1 


1 


1 


1 


1 


0 


1 


1 






3 


13 


0 


0 


0 


0 


0 


0 


0 


0 






0 


14 


0 


0 


1 


0 


1 


1 


1 


1 






4 


15 


1 


1 


1 


0 


0 


0 


0 


p 






0 


' U 


1 


1 


1 


1 


0 


1 


t) 


0 






1 ' 


17 


1 


1 


0 


1 


1 


1 


1 


1 






4 


18 


0 


0 


0 


0 


0 


0 


0 


1 




0 


1 


19 


0 


0 


0 


0 


0 


0 


1 


0 




0 


1 


20 


0 


0 


0 


0 


0 


1 


1 


1 




0 


3 


Item 

Total 

Score 


13 


11 


8 


9 


8 


9 


12 


12 
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Table 5 
Analysis of Variance 
of Original Student-Item Matrix, 
Estimation of Variance Components 
a*id Computation of Overall Reliability 



!• Analysis of Variance 

Source df MS 

Student 19 .749 

Item 7 .196 

SI 133 -183 

2* Estimation of Variance Components 
a| - MS(S) - MS (SI) .566 o^j - .183 

3. Reliability of Item Contribution ' 
to Total Score 



•566 + .183 
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Table 6 

Analysis of Variance, 
Estimation of Variance Components, 
and Calculation of Reliability Indices 
for Total and Subtest Scores 



1» Analysis of Variance 



Source 


df 


MS 


EMS 




^tudents 


19 


.749 


^'sKT) 


ST S 


Tests 


1 


.0 


^'SKT) 


ST I (T) 


ST 


19 


.645 


^'sKT) 




Items (T) 


6 


.229 


4l(T) 




SI(T) 


114 


.106 


''SI(T) 





T 



2. Reliability of Subtest Contribution to Total Score 

" MS(S)-MS(ST) = .104 » MS (ST) -MS (SI (T)) - .539 



.104 



o - 



.104 + .539 



.162 



3. Reliability of Itcm-Hithin-Subtest Contribution to 
Subtest Scores 

- .539 a' - .106 



ST 



SI(T) 



.539 



.539 + .106 



.836 



Mote: is a random effect, is a fixed effect 
in the analysis of variance model. 
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Figure Captions 

Figure 1. Frequency distribution of kindercartcn alphabet scores for total 
sample » for students above, and for students below grade level in read- 
ing achievement at en<^ of first grade (Calfee, in press). 

Figure 2. Cut-point result. ed on kindergarten alphabet scores and first 
grade reading achievement (Calfee, in press). 

Figure 3. A factorial structure on dimensions of reading for instructions 
and assessment. 

Figure 4. Sample science test item with illustration. 

Figure 5. Illustration of training and transfer matrix used by Silberman 
(1967) for assessment of decoding principles in beginning reading 
curriculum. 
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TASK 


hatcrials 


fa«i 1 tar Topic 


Unf^iliar Topic 


Reading 
Mode 


Question 
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Vocabulary 


Oi f'icult 
Vocabulary 


£asy 
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Literal 
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VOCAB'JLARY = defining^ knov/ikg^ synonyms, 

RECOGfllZING USAGE 

What is an Orifi ce? 

"a mouth or similar OPEMIfiG; 
A HOLE; AfJ aperture" 

What is the Orifice in this Picture? 




Orifice 
Air inlet 



Typicol gas burner 



Flcure 4. Sample science test 1 ten with lllustr.itlon 
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Figure 5. Illuscraclon of training and transfer 
■atrix used by Silbeman (1967) for assess- 
■ent of decoding principles in beginning 
reading curriculum 
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Footnotes 

1. 



Domain-referenced testing probably comes closest in spirit to the conceptual* 
isation that seems mo'^t useful to us. Tlicory and practice remain to be established 
£or doTiain- referenced tests, hough some interesting beginnings exist (Ilively, 1974, 
especially chapters by Hillnian, Miller, and Nitko; Kn.^pp, 1968). 
2 

HcCul lough (1937) has presented evidence for independence of coniprchension 

processes in the form of lo-j to moderate correlations between elementary students* 

o 

responses to comprehsnsioa questions about details, main idea, sequence, and 
creative reading, uafortunately, the number of items was small, and test reli- 
abilities verc not reported. Thus, tho modest size of the correlations is not 
strong evidence of independence, though the data are suggestive. 
3 

Holland (1975) has given thought to desirable characteristics of tests ^^^^ 
instructional decision-making, and presents some interesting indices: 

(a) Vbat proportion of the instructional time is used by testing versus 
teaching? 

(b) Does the test provide useful information for sorting students into 
Instructional groups; if the test results say **as8ign all students to 
instruction A," the test has served no useful role for making a deci* 
sion. 

(e) Does the test promote valid decisions; does the student who passes the 
test succeed without instruction, and contrariwise? 
>olland*s methods of analysis are fairly crude, but it seems to us that the ques- 
tions are right. His conclusions about the usefulness of several instructional 
ayatesis arc generally disappointing, but seem to us based on too little data and 
too superficial an analysis. 
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Footnotes (continued) 

Laboratory research from several sources demonstrates the important relation 
between fluent skilled decoding and comprehension (e.g., Perfetti & Hogaboaro, 1975(a), 
1975(b) ; Cromer, 1970; following the analysis in Calfce, Arnold, & Drum, 
1976 )« To be sure, the training studies needed to establish causality remain 
to be done. It is far from clear that the teaching of decoding skills in regular 
classrooms receives the en^hasis that some reports suggest. For instance, in 
questioning teachers whose classrooms included some kind of compensatory reading 
program, it was found that less than one in five teachers of sixth grade students 
■ade any extensive us« of phonics rurriculum programs (Rubin, Trismen, Wilder, & 
Tates^ 1973). More than 95 percent of the teachers at all grade levels said 
that comprehension was a major goal. Another piece of information from this 
study bears on the relative emphasis on decoding skills: In second grade, 
75 percent of the teachers report that each child reads aloud to au adult once a 
weak or more often. By fourth grade, only 63 percent of the teachers report this 
Mch oral reading, and by sixth grade the figure is 57 percent* 
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OPEN DISCUSSION OF CALFEE PRESENTATION 

B. S*iITH: Bob, I missed soaething. You said that you dcn*t have to wcrry about 
the reliability of the individual test, but you do have to worry about the 
reliability over a set of administration errors. How are you going to get 
reliability over a set, if you haven* t got reliability in any of the elements in 
the set? 

CALFEE: That's a technical question, and one of these days I will write a 
technical answer to it, but basically the answer is going to taice this form: 
Look at a complex factorial test structure; time can be one of the disensions, 
as can production and recognition. Imagine a test, materials that may have ^0 or 
50 items in it, where you maybe have only two or three cr four items in a single 
cell. A way of measuring a reliability within a cell««which is where you ought 
to be measuring it-->is to compute the mean square residual error after you have 
extracted all of the systematic variance. 

As Cronbach points out in his analysis of **eliability, whatever is left over 
is a measure of the reliability of that test. So indeed the technical knowledge 
for answering that question exists, but that's not part of what we want to say to 
tbe classroom teacher. Another part of that answer goes back to the extremely 
rmliable test, where '.he pattern is either all successes or all errors. If you 
dasign a clean test that is aimed at the specific skill, you are very often going 
to get performance that looks very much like that, so it becomes, manifestly, 
withlo the cell, reliable. 
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We have developed tests that fall within a cell, that consist of six items, 
he ask the studert to pronounce several words when we are looking lor performance 
of a particular character . *e are not asking: "Did you get the word right or 
wong?" That requires several skills in itself. Instead, we are asking: "How 
did you handle the 'ou' in about?" We score just that, right or wrong. 

lie give the first iteos to the student, and if he or she makes a mistake, we 
say, "Gee, did you really understand what we are talking about? Because the 
right answer to that is this, and this is why." Then we try the second item. If 
ti%; or she misses that, we stop the testing right there. That's ail of the 
evidence we want that they don't know how to hari^tte that test. 

If they get one or the other of those right, we give them four more items. 
He find that they either are right on three or four of them- 'A smaj.1 number get 

4 

three, maybe IS or 2% or 5* will get a couple right, a couple wrong, a small 
number get one, and only one right. 

You get mostly a pattern where they get them all right or all wrong. The 
reliability problem can be treated in the most trivial way. 

If you really wanted, you could probably begin to pick me apart on some of 
the details. Making the system really work is going to take more than the harl 
waving. But the tecnnical background is available in Cronbach's theory of 
generalizability. For the experimental psychologists, who ^re not aware of that, 
let me say it is a fundamentally important work that is going to change your 
concepts of reliability greatly over tihe next decade. 

CAZDEN: hhat is the reference to that^ 
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CALFEE: Cronbach, Gles'er, Nanda, and Rajaratnam, The dependability of behavioral 
^aeaaureaents: Theory of generalizability . 

BLOCK: How do you see the interface between the outcomes of testing and what the 
teacher can do instructionally for a given child? It*s very nice to have 
tailored assessment devices, but if we can*t differentiate descriptions as a 
function of those decisions, what good are they? How do you see that interface 
working out? 

CALFEE: bhen I think about it seriously, I say that first we have to divide wnat 
we mean by reading into a small number of coherent areas. Probably the research 
ought to aim at one of those at a time, and if it were up to me, I would try to 
answer that question for decoding. I believe you can work on the answer to the 
question that you have asked by looking at decoding as a separate problem, 
independent of the other areas of reading. If we do get a model for answering 
that question for this one area, we would be in much better shape to know how to 
aolye it for comprehension and for vocabulary development. 

If I am wrong in the way I am carving up reading t or if I am wrong in the 
basic assumption that reading is a bunch of separable skills, the research isn't 
going to turn up anything very interesting. But what I would want to do is say, 
"Okay, take d«.coding, let*s carve decoding up into a small number of coherent 
areas. Let's ask, what are the major dimensions of curriculun development?" They 
are lAat will become the segments of the curriculum. If you ara really 
interested, I will send you a paper. 

BLOCK: I really am, because 1 always find it difficult. 
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CALFEE: What is the means of delivery, because you can teach the same thing in 
many ways, What are tne factors, and what are some reasonable levels of those 
factors? There is a lot of thinking about the curriculum before you do anything. 
Now, let»s think about teachers. The teacher is not a homogeneous entity, far 
from it* Teachers come in a variety of forms, and I would hate to do research on 
teachers any more without including teacher training as part of the design. So 
dimensions of teachers and dimensions of teacher training programs are important. 

Then I would say, "Let me begin a design process. Let me try to get a 
design that might use 20 or 30 teachers, over the course of a year, in a fairly 
well controlled, but natural, situation. And let's collect data consistently." 
What I am talking about is do-able. You have to have good political relations 
with teachers and teacher units. I collect the data, and I look at it for a 
year, and then I know how to do the next experiment. The outccme would be a 
validation of certain training programs for teachers, appropriate for certain 
kinds of classrooms and students, with answers about where the important 
curriculum decisions iiave to be made. We are not going to come up with a 
curriculun, but you are going to know how to use the chunks of curriculim you 
have* 

What Cronbach points out is something that Kerb Clark has also pointed out: 
There are fixed effects in this business, but there are also random effects, and 
you can control both of those. 

GORDON: I found myself in enthusiastic agreement with most of the points you 
were making, but when you came to your summary, you, at least by implication, 
introduced a contradiction. I think you suggest that standardized testing is a 
terribly useful and dependable device, so that you didn't have much argument with 
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it. I think that's true. But if we take seriously the things that you are 
talking about and begin to achieve these, we are going to change the conditions 
of learning. That introduction to your sujniary ought to indicate that under 
traditional conditions, under unchanged conditions, or in the absence of success 
at the things you are talking about, that prediction holds. If you succeed in 
the things are are talking about, we are going to change the validity of those 
predictions. 

CALFEE: That's right. It is not a conclusion that I am unaware of. Let me 
again refer to my own teaching. The poor students in my classes are getting 
tested every week. I have tutors, who are assigned to help people, and there are 
fixed standards, so I have a good standardized testing procedure • There are 
exams, big exams, that ask for mastery of statistical concepts in a global way. 
The student has got to "get it all together." The standards are fixed, unlike 
standardized tests. They are a rat race, a treadmill that gets faster as the 
norms go up. I don't use that approach, and I think that's the answer to the 
contraduction that you refered to. Following the national assessment model, i^ 
everybody did perfectly on what seemed to us to be a reasonable set of general 
items, then who cares about norms any more? That would radicalize the testing 
business . 

RESNICK: Ond the education business. 

CALFEE: And the education business. I face that in my class by setting absolute 
standards. If you get 90X of all of the points on all of the exams, you get an 
A. And they are tough exams. I can readily get evidence on that from the 
students. Something like SOU of the students in my courses get an A, and it's 
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not because I am grading easy. They learn. 

RESNICK: That general issue is a good one to close cn; it leads to some radical 
and hopeful thoughts for the future. 

END SESSION 
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